Failure Classification and Inference in Large-Scale Systems: A Systematic Study of Failures in PlanetLab

نویسندگان

  • Sourabh Jain
  • Rohini Prinja
  • Abhishek Chandra
  • Zhi-Li Zhang
چکیده

Large-scale distributed systems are prone to frequent failures, which could be caused by a variety of factors related to network, hardware, and software problems. Any downtime due to failures, whatever the cause, can lead to large disruptions and huge losses. Identifying the location and cause of a failure is critical for the reliability and availability of such systems. However, identifying the actual cause of failures in such systems is a challenging task due to their large scale and variety of failure causes. In this work, we try to understand failures in a large-scale system through a two-step methodology: (i) classifying failures based on their statistical properties, and (ii) using additional monitoring data to explain these failures. We illustrate our methodology through a systematic study of failures in PlanetLab over a 3-month period. Our results show that most of the failures that required restarting a node were of small size and lasted for long durations. We also found that incorporating geographic information into our analysis enabled us to find site-wise correlated failures. We were also able to explain some failures by using error-message information collected by the monitoring nodes, and some of short-lived failures by transient CPU overloads on machines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MTBF evaluation for 2-out-of-3 redundant repairable systems with common cause and cascade failures considering fuzzy rates for failures and repair: a case study of a centrifugal water pumping system

In many cases, redundant systems are beset by both independent and dependent failures. Ignoring dependent variables in MTBF evaluation of redundant systems hastens the occurrence of failure, causing it to take place before the expected time, hence decreasing safety and creating irreversible damages. Common cause failure (CCF) and cascading failure are two varieties of dependent failures, both l...

متن کامل

Implementation of Traditional (S-R)-Based PM Method with Bayesian Inference

In order to perform Preventive Maintenance (PM), two approaches have evolved in the literature. The traditional approach is based on the use of statistical and reliability analysis of equipment failure. Under statistical-reliability (S-R)-based PM, the objective of achieving the minimum total cost is pursued by establishing fixed PM intervals, which are statistically optimal, at which to replac...

متن کامل

Subtleties in Tolerating Correlated Failures

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using a combination of experimental and mathematical analysis of several real-world fa...

متن کامل

A One-Stage Two-Machine Replacement Strategy Based on the Bayesian Inference Method

In this research, we consider an application of the Bayesian Inferences in machine replacement problem. The application is concerned with the time to replace two machines producing a specific product; each machine doing a special operation on the product when there are manufacturing defects because of failures. A common practice for this kind of problem is to fit a single distribution to the co...

متن کامل

Common architecture for distributed probabilistic Internet fault diagnosis

This thesis presents a new approach to root cause localization and fault diagnosis in the Internet based on a Common Architecture for Probabilistic Reasoning in the Internet (CAPRI) in which distributed, heterogeneous diagnostic agents efficiently conduct diagnostic tests and communicate observations, beliefs, and knowledge to probabilistically infer the cause of network failures. Unlike previo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008